11 research outputs found

    Terminology Integration in Statistical Machine Translation

    Get PDF
    Elektroniskā versija nesatur pielikumusPromocijas darbs apraksta autora izpētītas metodes un izstrādātus rīkus divvalodu terminoloģijas integrācijai statistiskās mašīntulkošanas sistēmās. Autors darbā piedāvā inovatīvas metodes terminu integrācijai SMT sistēmu trenēšanas fāzē (ar statiskas integrācijas palīdzību) un tulkošanas fāzē (ar dinamiskas integrācijas palīdzību). Darbā uzmanība pievērsta ne tikai metodēm terminu integrācijai SMT, bet arī metodēm valodas resursu, kas nepieciešami dažādu uzdevumu veikšanai terminu integrācijas SMT darbplūsmās, ieguvei. Piedāvātās metodes ir novērtētas automātiskas un manuālas novērtēšanas eksperimentos. Iegūtie rezultāti parāda, ka statiskās un dinamiskās integrācijas metodes ļauj būtiski uzlabot tulkošanas kvalitāti. Darbā aprakstītie rezultāti ir aprobēti vairākos pētniecības projektos un ieviesti praktiskos risinājumos. Atslēgvārdi: statistiskā mašīntulkošana, terminoloģija, starpvalodu informācijas izvilkšanaThe doctoral thesis describes methods and tools researched and developed by the author for bilingual terminology integration into statistical machine translation systems. The author presents novel methods for terminology integration in SMT systems during training (through static integration) and during translation (through dynamic integration). The work focusses not only on the SMT integration techniques, but also on methods for acquisition of linguistic resources that are necessary for different tasks involved in workflows for terminology integration in SMT systems. The proposed methods have been evaluated using automatic and manual evaluation methods. The results show that both static and dynamic integration methods allow increasing translation quality. The thesis describes also areas where the methods have been approbated in practice. Keywords: statistical machine translation, terminology, cross-lingual information extractio

    Terminology localization guidelines for the national scenario

    Get PDF
    The paper is a preprint of the paper accepted to the LREC 2014 : The 9th edition of the Language Resources and Evaluation Conference scheduled May 28, 2014 - May 30, 2014 in Reykjavik (Iceland).This paper presents a set of principles and practical guidelines for terminology work in the national scenario to ensure a harmonized approach in term localization. These linguistic principles and guidelines are elaborated by the Terminology Commission in Latvia in the domain of Information and Communication Technology (ICT). We also present a novel approach in a corpus-based selection and an evaluation of the most frequently used terms. Analysis of the terms proves that, in general, in the normative terminology work in Latvia localized terms are coined according to these guidelines. We further evaluate how terms included in the database of official terminology are adopted in the general use such as newspaper articles, blogs, forums, websites etc. Our evaluation shows that in a non-normative context the official terminology faces a strong competition from other variations of localized terms. Conclusions and recommendations from lexical analysis of localized terms are provided. We hope that presented guidelines and approach in evaluation will be useful to terminology institutions, regulative authorities and researchers in different countries that are involved in the national terminology work.The research leading to these results has received funding from the research project “Optimization methods of large scale statistical models for innovative machine translation technologies” of European Regional Development Fund, contract nr. 2013/0038/2DP/2.1.1.1.0/13/APIA/VIAA/029

    Open Terminology Management and Sharing Toolkit for Federation of Terminology Databases

    Full text link
    Consolidated access to current and reliable terms from different subject fields and languages is necessary for content creators and translators. Terminology is also needed in AI applications such as machine translation, speech recognition, information extraction, and other natural language processing tools. In this work, we facilitate standards-based sharing and management of terminology resources by providing an open terminology management solution - the EuroTermBank Toolkit. It allows organisations to manage and search their terms, create term collections, and share them within and outside the organisation by participating in the network of federated databases. The data curated in the federated databases are automatically shared with EuroTermBank, the largest multilingual terminology resource in Europe, allowing translators and language service providers as well as researchers and students to access terminology resources in their most current version.Comment: LREC 202

    Informācijas tehnoloģijas terminu lietošanas paradumi publiskajā saziņā

    Get PDF
    Informācijas un komunikācijas tehnoloģijas (IKT) termini galvenokārt tiek radīti angļu valodā un pēc tam lokalizēti citās valodās. Valodu morfoloģisko un terminrades tradīciju atšķirību dēļ šāda lokalizācija mēdz būt diezgan haotiska. Latvijas IKT terminu lokalizētāji ir izstrādājuši samērā stingru, t.s. kvazialgoritmisko pieeju, ko aprobējuši vairāk nekā 15 gadu laikā. Šajā rakstā uz biežāk lietoto terminu piemēra parādīta pieejas dzīvotspēja. Noraidīts izplatītais uzskats, ka IKT terminu lokalizējumi piesārņo latviešu valodu ar svešvārdiem. Analizēta oficiāli apstiprināto terminu lietojamība tekstos un iemesli, kādēļ tie dažkārt sastop ikdienas lietotāju pretestību

    Deep dive machine translation

    Get PDF
    Machine Translation (MT) is one of the oldest language technologies having been researched for more than 70 years. However, it is only during the last decade that it has been widely accepted by the general public, to the point where in many cases it has become an indispensable tool for the global community, supporting communication between nations and lowering language barriers. Still, there remain major gaps in the technology that need addressing before it can be successfully applied in under-resourced settings, can understand context and use world knowledge. This chapter provides an overview of the current state-of-the-art in the field of MT, offers technical and scientific forecasting for 2030, and provides recommendations for the advancement of MT as a critical technology if the goal of digital language equality in Europe is to be achieved

    Development of a concatenative Latvian language speech synthesis system

    No full text
    Bakalaura darbā ir aprakstīta autora izstrādāta konkatenatīvas latviešu valodas runas sintēzes sistēma. Bakalaura darba ietvaros tika izstrādāta sistēmas arhitektūra un runas sintēzes bibliotēka, kas nodrošina nepieciešamo funkcionalitāti, lai sistēmu būtu iespējams integrēt ārējos risinājumos. Darbā tiek aprakstīti procesi, kā no teksta latviešu valodā tiek iegūts audio fails, kas satur sākotnējā teksta akustisko reprezentāciju, respektīvi, runu. Darbā tiek arī apskatīti principi, kā no atsevišķiem audio fragmentiem tiek izveidots nepieciešamais audio fails, fragmentus ar dažādām metodēm kombinējot kopā.The Bachelor paper describes a concatenative Latvian language speech synthesis system developed by the author. Within the framework of the Bachelor paper the architecture of the system and the speech synthesis library, that provides the necessary functionality, so that the system can be integrated into external systems, was developed. In the paper there are examined processes of how an audio file, is gained from a text in Latvian, where the resulting audio file contains the acoustic representation of the text, respectively, the speech. In the paper the author looks at principles of how an audio file is combined together from smaller audio fragments using different concatenation methods

    Usage habits of information technology terminology in public communication

    No full text
    Raksts iesniegts publicēšanai 3.Drezena konferences rakstu krājumā.Informācijas un komunikācijas tehnoloģijas (IKT) termini galvenokārt tiek radīti angļu valodā un pēc tam lokalizēti citās valodās. Valodu morfoloģisko un terminrades tradīciju atšķirību dēļ šāda lokalizācija mēdz būt diezgan haotiska. Latvijas IKT terminu lokalizētāji ir izstrādājuši samērā stingru, t. s. kvazialgoritmisko pieeju, ko aprobējuši vairāk nekā 15 gadu laikā. Šajā rakstā uz biežāk lietoto terminu piemēra parādīta pieejas dzīvotspēja. Noraidīts izplatītais uzskats, ka IKT terminu lokalizējumi piesārņo latviešu valodu ar svešvārdiem. Analizēta oficiāli apstiprināto terminu lietojamība tekstos un iemesli, kādēļ tie dažkārt sastop ikdienas lietotāju pretestību
    corecore